-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition causing 'ack timeout 4' disconnects #4
Conversation
The AsyncClient::send() methods sets a boolean to true after pushing data over the TCP socket successfully using tcp_output(). It also sets a timestamp to remember at what time the data was sent. The AsyncClient::_sent() callback method reacts to ACKs coming from the connected client. This method sets the boolean to false. In the AsyncClient::_poll() method, a check is done to see if the boolean is true ("I'm waiting for an ACK") and if the time at which the data was sent is too long ago (5000 ms). If this is the case, a connection issue with the connected client is assumed and the connection is forcibly closed by the server. The race condition is when these operations get mixed up, because of multithreading behavior. The _sent() method can be called during the execution of the send() method: 1. send() sends out data using tcp_output() 2. _sent() is called because an ACK is processed, sets boolean to false 3. send() continues and sets boolean to true + timestamp to "now" After this, the data exchange with the client was successful. Data were sent and the ACK was seen. However, the boolean ended up as true, making the _poll() method think that an ACK is still to be expected. As a result, 5000 ms later, the connection is dropped. This commit fixes the code by first registering that an ACK is expected, before calling tcp_output(). This way, there is no race condition when the ACK is processed right after that call. Additionally, I changed the boolean to an integer counter value. The server might send multiple messages to the client, resulting in multiple expected ACKs. A boolean does not cover this situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! 🎉
@mmakaay what about |
I have looked at ESPAsyncTCP while I was tracking this issue, and it does use the same construction to keep track of ACKs. However, a lot has been done to that code, and I saw some very specific comments about possible crashes and how to prevent them. The two libraries looked so different at places, that I could not decide right away if the same fix was needed. Somewhere in the back of my mind, I made a mental note about checking out the actual differences between these two libraries, because when looking at them, I wondered if the ESP32 AsyncTCP library wouldn't benefit from some of the developments in the ESPAsyncTCP library. |
Hey @OttoWinter , based on the revert of the AsyncTCP version bump by @glmnet : Do you agree it would be best to revert the patch from this pull request? It's no good if there are devices that show regression in the form of extra disconnects. |
@mmakaay Agree, though it would make sense to understand how that regression could be caused too. Maybe we're encountering another race condition now because some code is executed on another ESP32 core in parallel (or interleaved execution), so it might make sense to use mutexes in this code. |
Yes, I think that mutexes might improve this. What I didn't look into further so far, is how this setup would work when actual ACKs got missing. Where are the retransmits for this arranged? Should lwIP handle these? The AsyncTCP ACK handling is considering a sunny day scenario, where no ACKs get lost, and considers a lost ACK as a reason to kick the connection. That is not the best scenario to work with. Possibly the regression hits when there are actual packages getting lost on the network? It shouldn't kill the connection in such case though, since the whole reason for TCP is allowing for such issues. From a TCP perspective, the connection might be considered sub-optimal, but very much usable nonetheless. I can dive a bit deeper in the stack and see what lwIP actually does. If that layer does handle ACKs and retransmits, I'm not so sure that ACK handling event ought to be in AsyncTCP to begin with. |
Yes lwIP does handle retransmits :) iirc the reason for it is for cleanup purposes. With the ack logic you know when it's safe to write to a buffer again (if the buffer wasn't copied to lwIP's internal one before). But not 100% sure about that anymore. |
I managed to reproduce this issue by introducing a delay in the send bool AsyncClient::send()
{
int8_t err = ERR_OK;
err = _tcp_output(_pcb, _closed_slot);
if (err == ERR_OK)
{
vTaskDelay(50 / portTICK_RATE_MS);
// The 50 is to give the network plenty of time to send the ack,
// I tried with 25 and there were no timeouts
_pcb_busy = true;
_pcb_sent_at = millis();
return true;
}
return false;
} And fixed it with this (@mmakaay ideas) bool AsyncClient::send()
{
int8_t err = ERR_OK;
auto pcb_sent_at_back = _pcb_sent_at;
auto pcb_busy_back = _pcb_busy;
_pcb_sent_at = millis();
_pcb_busy = true;
err = _tcp_output(_pcb, _closed_slot);
if (err == ERR_OK)
{
return true;
}
_pcb_busy = pcb_busy_back;
_pcb_sent_at = pcb_sent_at_back;
return false;
} also marked volatile bool _pcb_busy;
volatile uint32_t _pcb_sent_at; _pcb_busy is there just to tell us that _pcb_sent_at has a valid value, and ack timeout should be observed, no harm at all, so both values are backup in case |
Was making them volatile required to make it work correctly? Last night, I came up with a slighly different and possibly cleaner fix. I will explain my code below here. Before I explain the code, here a consideration that I included in this change: On to the code changes:
This makes the block of related fields from the bool _ack_pcb;
uint32_t _tx_last_packet;
uint32_t _rx_ack_len;
uint32_t _rx_last_packet;
uint32_t _rx_since_timeout;
uint32_t _rx_last_ack;
uint32_t _ack_timeout;
uint16_t _connect_port; A few side-remarks:
I removed all uses of the A much simplified send method, which now register the last transmitted packet if a send succeeds. bool AsyncClient::send(){
auto backup = _tx_last_packet;
_tx_last_packet = millis();
if (_tcp_output(_pcb, _closed_slot) == ERR_OK) {
return true;
}
_tx_last_packet = backup;
return false;
} The method that handles an incoming ack now updates the int8_t AsyncClient::_sent(tcp_pcb* pcb, uint16_t len) {
_rx_last_packet = millis();
_rx_last_ack = millis();
// ... snip
} The logic for the ACK timeouts now checks if the last incoming ack time is before the last sent packet time. That eiliminates the need for a boolean to enable the ACK timeout check. int8_t AsyncClient::_poll(tcp_pcb* pcb){
// ... snip
// ACK Timeout
if(_ack_timeout && _tx_last_packet > _rx_last_ack && (now - _tx_last_packet) >= _ack_timeout){
log_w("ack timeout %d", pcb->state);
if(_timeout_cb)
_timeout_cb(_timeout_cb_arg, this, (now - _tx_last_packet));
return ERR_OK;
}
// ... snip
} This code is giving me very stable connections. I have 8 OTA loggers and HA active, and they now run as stable for me as when I fully disable the ACK timeout checks. Do you want me to bundle this up as a new pull request? |
The volatile added just by instinct. Regarding your code you should check your timestamp comparison. Otherwise looks good to me.
should check for rollovers, see https://arduino.stackexchange.com/questions/12587/how-can-i-handle-the-millis-rollover |
Good point. I was wondering if it would be worth it, but let's prevent that once per 50 days possible disconnect issue 👍 if(_ack_timeout){
uint32_t one_day = 86400000;
bool last_tx_is_after_last_ack = (_rx_last_ack - _tx_last_packet + one_day) < one_day;
if(last_tx_is_after_last_ack && (now - _tx_last_packet) >= _ack_timeout) {
log_w("ack timeout %d", pcb->state);
if(_timeout_cb)
_timeout_cb(_timeout_cb_arg, this, (now - _tx_last_packet));
return ERR_OK;
}
} |
Okay, a followup pull request was created at #5 I did not include the volatile flags, because I do not think they would be useful. |
After my first attempt at fixing the client disconnects (OttoWinter#4) got merged into AsyncTCP, it turned out that there was regression for some devices: the connection stability actually went down instead of up. After a lot of debugging and discussion with @glmnet (some of the results can be found in the above pull request discussion), we came up with an improved fix for the disconnect issues. **Changed:** The code that checks for ACK timeouts has been simplified in such way, that only two timestamps are now used to determine if an ACK timeout has happened: the time of the last sent packet (this was already recorded), and the time of the last received ACK from the client (this has been added). Using these timestamps, there is no more need for a separate field to keep track if we are waiting for an ACK or not (`_pcb_busy`). Therefore, this field was completely removed from the code. While I was at it, I renamed a few variables to make the code easier to read and more consistent. **Results:** I connected Home Assistant plus 8 OTA loggers at the same time, using very verbose logging output. This normally was an easy way to trigger the disconnect errors. It turned out, this solution runs as solid for me, as when disabling the ACK timeout checks completely (using `AsyncClient::setAckTimeout(0)`). (cherry picked from commit 6f37967)
After my first attempt at fixing the client disconnects (OttoWinter#4) got merged into AsyncTCP, it turned out that there was regression for some devices: the connection stability actually went down instead of up. After a lot of debugging and discussion with @glmnet (some of the results can be found in the above pull request discussion), we came up with an improved fix for the disconnect issues. **Changed:** The code that checks for ACK timeouts has been simplified in such way, that only two timestamps are now used to determine if an ACK timeout has happened: the time of the last sent packet (this was already recorded), and the time of the last received ACK from the client (this has been added). Using these timestamps, there is no more need for a separate field to keep track if we are waiting for an ACK or not (`_pcb_busy`). Therefore, this field was completely removed from the code. While I was at it, I renamed a few variables to make the code easier to read and more consistent. **Results:** I connected Home Assistant plus 8 OTA loggers at the same time, using very verbose logging output. This normally was an easy way to trigger the disconnect errors. It turned out, this solution runs as solid for me, as when disabling the ACK timeout checks completely (using `AsyncClient::setAckTimeout(0)`).
A pull request, as requested by @OttoWinter in the esphome/issues#1921 (comment)
The AsyncClient::send() methods sets a boolean to true after pushing
data over the TCP socket successfully using tcp_output(). It also sets
a timestamp to remember at what time the data was sent.
The AsyncClient::_sent() callback method reacts to ACKs coming from
the connected client. This method sets the boolean to false.
In the AsyncClient::_poll() method, a check is done to see if the
boolean is true ("I'm waiting for an ACK") and if the time at which
the data was sent is too long ago (5000 ms). If this is the case,
a connection issue with the connected client is assumed and the
connection is forcibly closed by the server.
The race condition is when these operations get mixed up, because
of multithreading behavior. The _sent() method can be called during
the execution of the send() method:
After this, the data exchange with the client was successful. Data were
sent and the ACK was seen.
However, the boolean ended up as true, making the _poll() method think
that an ACK is still to be expected. As a result, 5000 ms later, the
connection is dropped.
This commit fixes the code by first registering that an ACK is
expected, before calling tcp_output(). This way, there is no race
condition when the ACK is processed right after that call.
Additionally, I changed the boolean to an integer counter value.
The server might send multiple messages to the client, resulting in
multiple expected ACKs. A boolean does not cover this situation.